skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Hu, Michael"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available October 1, 2026
  2. Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when two conditions are met: the formal language should capture the dependency structures present in natural language, and it should remain within the computational limitations of the model architecture. We experiment with pre-pretraining (training on formal language before natural languages) on transformers and find that formal languages capturing hierarchical dependencies indeed enable language models to achieve lower loss on natural language and better linguistic generalization compared to other formal languages. We also find modest support for the hypothesis that the formal language should fall within the computational limitations of the architecture. Strikingly, pre-pretraining reduces loss more efficiently than training on a matched amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. Finally, we also give mechanistic evidence of transfer from formal to natural language: attention heads acquired during pre-pretraining remain crucial for the model's performance on syntactic evaluations. 
    more » « less
  3. 57Fe nuclear resonance vibrational spectroscopy (NRVS) is used to study the tetranuclear iron clusters bearing a terminal Fe(iii)–O/OH moiety. The redox states of the three remote basal iron sites modulate the Fe(iii)–O/OH vibrational frequencies. 
    more » « less
  4. Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity. To understand this inconsistency, we unify existing methods into a standard framework, showing they are equivalent to solving a common optimization problem: minimize average loss subject to a method-specific mixing law -- an implicit assumption on the relationship between loss and mixture proportions. This framework suggests that measuring the fidelity of a method's mixing law can offer insights into its performance. Empirically, we find that existing methods set their mixing law parameters inaccurately, resulting in the inconsistent mixing performance we observe. Using this insight, we derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions. Aioli outperforms stratified sampling on 6 out of 6 datasets by an average of 0.27 test perplexity points, whereas existing methods fail to consistently beat stratified sampling, doing up to 6.9 points worse. Moreover, in a practical setting where proportions are learned on shorter runs due to computational constraints, Aioli can dynamically adjust these proportions over the full training run, consistently improving performance over existing methods by up to 12.012 test perplexity points. 
    more » « less
  5. Soares, Cláudio (Ed.)
    Abstract Extremophile organisms are known that can metabolize at temperatures down to − 25 °C (psychrophiles) and up to 122 °C (hyperthermophiles). Understanding viability under extreme conditions is relevant for human health, biotechnological applications, and our search for life elsewhere in the universe. Information about the stability and dynamics of proteins under environmental extremes is an important factor in this regard. Here we compare the dynamics of small Fe-S proteins – rubredoxins – from psychrophilic and hyperthermophilic microorganisms, using three different nuclear techniques as well as molecular dynamics calculations to quantify motion at the Fe site. The theory of ‘corresponding states’ posits that homologous proteins from different extremophiles have comparable flexibilities at the optimum growth temperatures of their respective organisms. Although ‘corresponding states’ would predict greater flexibility for rubredoxins that operate at low temperatures, we find that from 4 to 300 K, the dynamics of the Fe sites in these homologous proteins are essentially equivalent. 
    more » « less
  6. The impact of randomness on model training is poorly understood. How do differences in data order and initialization actually manifest in the model, such that some training runs outperform others or converge faster? Furthermore, how can we interpret the resulting training dynamics and the phase transitions that characterize different trajectories? To understand the effect of randomness on the dynamics and outcomes of neural network training, we train models multiple times with different random seeds and compute a variety of metrics throughout training, such as the norm, mean, and variance of the neural network's weights. We then fit a hidden Markov model (HMM) over the resulting sequences of metrics. The HMM represents training as a stochastic process of transitions between latent states, providing an intuitive overview of significant changes during training. Using our method, we produce a low-dimensional, discrete representation of training dynamics on grokking tasks, image classification, and masked language modeling. We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence. 
    more » « less
  7. Abstract Solid–liquid phase transitions are basic physical processes, but atomically resolved microscopy has yet to capture their full dynamics. A new technique is developed for controlling the melting and freezing of self‐assembled molecular structures on a graphene field‐effect transistor (FET) that allows phase‐transition behavior to be imaged using atomically resolved scanning tunneling microscopy. This is achieved by applying electric fields to 2,3,5,6‐tetrafluoro‐7,7,8,8‐tetracyanoquinodimethane‐decorated FETs to induce reversible transitions between molecular solid and liquid phases at the FET surface. Nonequilibrium melting dynamics are visualized by rapidly heating the graphene substrate with an electrical current and imaging the resulting evolution toward new 2D equilibrium states. An analytical model is developed that explains observed mixed‐state phases based on spectroscopic measurement of solid and liquid molecular energy levels. The observed nonequilibrium melting dynamics are consistent with Monte Carlo simulations. 
    more » « less
  8. Abstract Isotopic fractionation has been linked to the lattice vibrations of materials through their phonon spectra. The Lamb-Mössbauer factor (fLM) has the potential to provide information about the lattice vibrations in materials. We constrain the temperature evolution of the fLM of γ- and ε-Fe at in situ high-P-T conditions between 1650 K and the melting point. We find that the vibrations of γ- and ε-Fe can be described using a quasiharmonic model with a pressure- and temperature-dependent Debye temperature computed from the measured fLM. From the Debye temperature, we derive the equilibrium isotopic fractionation β-factor of iron. Our results show that the quasiharmonic behavior of metallic iron would lower the value of lnβFe57/54 by 0.1‰ at 1600–2800 K and 50 GPa when compared to the extrapolation of room temperature nuclear resonant inelastic X-ray scattering data. Our study suggests that anharmonicity may be more prevalent in Fe metal than in lower mantle minerals at 2800 K and 50 GPa, a relevant condition for the core formation, and the silicate mantle may be isotopically heavy in iron. 
    more » « less